Distributed Autonomous Online Learning: 
Regrets and Intrinsic Privacy- Preserving 
Properties 

O Feng Yan Shreyas Sundaram 

^ Department of CS Department of ECE 

Purdue University University of Waterloo 

Ph S.V. N. Vishwanathan 



O 
O 

o 



X 



Departments of Statistics and CS 
Purdue University 

Yuan Qi 

Departments of CS and Statistics 



o 

Q Purdue University 

^ February 7, 2011 

> 

Abstract 



Online learning has become increasingly popular on handling massive 
data. The sequential nature of online learning, however, requires a cen- 
tralized learner to store data and update parameters. In this paper, we 
consider online learning with distributed data sources. The autonomous 
learners update local parameters based on local data sources and periodi- 
cally exchange information with a small subset of neighbors in a communi- 
cation network. We derive the regret bound for strongly convex functions 



that generalizes the work by Ram et al. 2010 for convex functions. More 
importantly, we show that our algorithm has intrinsic privacy-preserving 
5^ properties, and we prove the sufficient and necessary conditions for privacy 

preservation in the network. These conditions imply that for networks 
with greater-than-one connectivity, a malicious learner cannot reconstruct 
the subgradients (and sensitive raw data) of other learners, which makes 
our algorithm appealing in privacy sensitive applications. 



1 Introduction 

Online learning has emerged as an attractive paradigm in machine learning 
given the ever-increasing amounts of data being collected everyday. It efhciently 
reduces the training time by processing the data only once, assuming that all 
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the training data are available at a central location. For many applications, 
however, this assumption is problematic. For instance, sensor networks may be 
deployed in rain forests and collect data autonomously. The cost of transmitting 
all the data to a central server can be prohibitively high. Also, sharing sensitive 
data might lead to information leakage and raise privacy concerns. For example, 
banks collect credit information about their customers but might not share the 
data with other financial institutions for privacy concerns. Similarly privacy 
concerns might prevent sharing of patient records across hospitals. 

Therefore it is desirable to conduct distributed learning in a fully decentral- 
ized setting. Specifically, we treat individual computational units (e.g., pro- 
cessors) in a network as autonomous learner. They learn model parameters 
independently from their local data sources, and pass estimation information to 
their neighbors in a communication network. By doing so, distributed learning 
avoids sharing original, sensitive data with others and storing data in a central 
location. 

In this paper, we consider a general distributed autonomous online learning 
algorithm to learn from fully decentralized data sources. We address two im- 
portant questions associated with this general algorithm. The first question is 
how the distributed online learners perform compared with the optimal learner 
chosen in hindsight. To this end we derive the regret bound for strongly convex 



functions. Our work is closely related to the recent work by Ram et al. 2010 



Nedic & Ozdaglar 2009 ; the main difference lies in our analysis for strongly 



convex functions, which naturally extends the results of Ram et al. 2010 



The second question is how the topology of the computational network af- 
fects privacy preservation. To answer this question, we draw ideas from the 
modern control theory to model the distributed online learning algorithm as 
a structured linear time-invariant system, and we establish theorems on nec- 
essary and sufficient conditions that a malicious learner can reconstruct the 
subgradients for other learners at other locations. Based on these conditions, 
we conclude that for most communication topologies, namely with connectivity 
greater than one, our algorithm inherently prevents the reconstruction of the 
subgradients at other locations, therefore avoiding information leakage. Un- 
like previous works on privacy-preserving learning that mostly alter the original 
learning algorithms by patching cryptographical tools, such as secure multi- 

and random- 



party computation 
ization 



Sakuma & Aral 2010 Kearns et al. 2007 



Chaudhuri & Monteleoni 2009 , or data aggregation 



Riiping 2010 



Avidan & Butman 2007 , our privacy-preserving properties are intrinsic in the 



sense that they do not require any modifications to the algorithm but are solely 
determined by the communication network topology of the distributed learners. 
The main contributions of this paper include: 

• We present a distributed autonomous online learning algorithm that com- 
putes local subgradients and shares parameter vectors between nodes in a 
communication network. We derive its regret bounds for strongly convex 
(hence convex) functions. 



We use results from the modern control theory to show the connection 
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between the reconstructability of local subgradients and the topology of 
the communication network, which implies privacy preservation of local 
data for well-chosen communication networks. 



2 Preliminaries 

Notation: Lower case letters {e.g., w) denote (column) vectors while upper 
case letters {e.g., A) denote matrices. We will denote the (j, i)-th element of 
A by Aji and the i-th column of A by Ai. Subscripts with t, t + 1 etc are 
used for indexing the parameter vector with respect to time while superscripts 
are used for indexing with respect to a processor. For instance, wl denotes 
the parameter vector of the z-th processor at time t. We use to denote the 
i-th basis vector (the vector of all zeros except one on the i*^ position), and 
e to denote the vector of all ones. Unless specified otherwise, |j-|| refers to the 

1/2 

Euclidean norm ||a;|| (^,- xf) , and (•, •) denotes the Euclidean dot product 

{x, X ) — XiX^. 

Sequential Online Learning: Online learning usually proceeds in trials. 
At each trial a data point Xt is given to the learner which produces a parameter 
vector wt from a convex set fl C M". One then computes some function of 
the inner product {wt,xt) in order to produce a label yt. The true label yt is 
revealed to the learner, which then incurs a convex (but not necessarily smooth) 
loss l{wt,Xt,yt) and the learner adjusts its parameter vector. If we succinctly 
denote ft{w) :— l{'w,Xt,yt), then online learning is equivalent to solving the 
following optimization problem in a stochastic fashion: 

T 

min J(w), where J{w) = V ft{w) and 1] C M", (1) 

t=l 

and the goal is to minimize the regret 

T 

Us = Mwt)- min J {w). (2) 



t=i 



wen 



For many applications, however, the data are not all available to a centralized 
learner to perform sequential online learning. 

Communication via Doubly Stochastic Matrix: We shall see that 
our autonomous learners exchange information with their neighbors. The com- 
munication pattern is defined by a weighted directed graph with a m-by-m 
adjacency matrix. A, is doubly stochastic. Recall that a matrix is said to be 
doubly stochastic if and only if all elements of A are non-negative and both rows 
and columns sum to one. 

In the following analysis of regret bounds, we are interested in the limiting 
behaviors of as /c — >■ cx). It is well known in finite-state Markov chain th eory 



that there are geometric bounds for A'' if A is irreducible and aperiodic Liu 
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2001 



i 

C > and < /3 < 1. 



(3) 



where C and /3 depend on the size and the topology of G. For example, the 
famous spectral geometric bound has C — \/m, /3 = the spectral gap of A. To 



this end, Duchi et al. 2010 examined the impact of different choices of A and 



network topologies on the convergence rate of the dual averaging algorithm 
for distributed optimization. Since the relationship between network topology 
and convergence rate is not the focus of this paper, we use the bound given in 
Chapter 12 of Liu 2001 in this paper for simplicity, where C = 2 and /3 is 



related to the minimum non-zero values of A. It is easy to show that our regret 
bounds can be modified accordingly if one use a general Markov mixing bound. 



3 Distributed Autonomous Online Learning 

For distributed autonomous online learning, we assume to have m local online 
learners using only data stored at local sites. At each trial m data points with 
i G {1,2,..., to} are given and the i-th learner updates model parameters based 
on the i-th point. The learner produces a parameter vector wl which is used to 
compute the prediction (wj, xl) and the corresponding loss fl{w) = l{w, x\, yl). 
The learners then exchange information with a selected set of their neighbors 
before updating w\ to wj+i. The communication pattern amongst processors 
is assumed to form a strongly (but not necessarily fully) connected graph. In 
particular, we will assume a directed weighted graph whose adjacency matrix 
A is doubly stochastic. One can interpret the entry Aji as the importance that 
learner i places on the parameter vector communicated by learner j. Of course, 
if Aji — then learners j does not send data to learner i. 
The corresponding optimization problem is 

T m 

min J(w) = ^ ^ fl {w) and VLCW, (4) 

and regret is measured with respect to the parameter vector wl of an arbitrary 
learner j: 

T m 

^^■4 = E E /* ^< ) - niin J{w) (5) 

t=l i=l 

If we denote ft = J^TLi /tQ our definition of the regret has the same form 
of the regret in sequential online learning for each local learner. Given N data 

^We abuse the notation ft hereinafter. 
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Algorithm 1 Distributed Autonomous Online Learning 



1: Input: The number of learners m; initial points wl, . . . w™; double stochas- 
tic matrix A — (Aji) G and maximum iterations T. 
2: for t = 1, . . . ,T do 
3: for each learner i = 1, . . . , m do 

5: Communicate wl with neighbors (as defined by A) and obtain their 
parameters. 

6: w^t+i ^ J2j -^jiwl — Vigl (Local subgradient descent) 
7: TOtVi Pn (™?+i) = argmin^gf; \\w - (Projection) 

8: end for 
9: end for 



points, there are T — N iterations or trial in sequential online learning. In our 
case, this number reduces down to T = ^. 

We will show the convergence of wl by bounding the regret TZda- In p articu- 



lar, we are interested in generalizing the celebrated yT and logT bounds 



Zinke- 



vich , 2003 Hazan et al. , 2007 of sequential online learning to distributed au- 



tonomous online learning. 

We present a general online learning algorithm for solving Q here. Specifi- 
cally, a local learner propagates the parameter to other learners. After receiving 
the parameters from other learners, each learner updates its local parameter 
through a linear combination of the received and its own old parameter. Then 
the local learner updates the local model parameter based on the data collected 
and the local subgradient. Via this cooperation, the learners learn a model from 
distributed data sequentially. The algorithm is summarized in Algorithm [T] 



3.1 Regret Bounds 

For our analysis we make the following standard assumptions, which are as- 
sumed to hold for all the proofs and theorems presented below. 1) Each fl is 
strongly convex with modulus A > (J^ 2) Aji 7^ if and only if the i^^ learner 
communicates with the j"^ learner. We further assume A is irreducible, aperi- 
odic, and there exists /3 < 1 as defined in ([S]). 3) is a closed convex subset 
of with non-empty interior. The subgradient d^fliw) can be computed for 
every w £ Q.. 4) The diameter diam(r2) — sup^, ^./^q \\x ~ x'\\ of £7 is bounded 
by < cx). 5) The set of optimal solutions of Q denoted by il* is non-empty. 
6) The norm of the subgradients of fl is bounded by L, and w\ are identically 
initialized. 

The following theorem characterizes the regret of Algorithm [T] The proof 
can be found in the appendix. 



■^Note that we allow for A = 0, in which case fl is just convex, but not strongly convex. 
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Theorem 1 // A > and we set rjt — htt th 



Mwi) - Mw*) < + log(T)), (6) 

t=i 

On the other hand, when A = 0, if we set rjt = then 

T 

Y,Mwl)-Mw*)<m{F + ACL^)Vf. (7) 
t=i 

C = YTfj 'J communication- graph- dependent constant. 

When ni — 1, Algorithmjl] reduces to the classical sequential online learning. 
Accordingl y, our bounds ( 7|^and ^ become the classical squar e root regret 



OWN) of 



Zinkevich 



2003 



and the logarithmic regret O(logT) of 



Hazan et al. 



2007 . When m > 1, recall that for every time t, the ra processors simultaneously 



process m data points. Therefore in T steps our learners process mT data 
points. If we let N = mT, then our bounds can be rewritten as 0{VmN) and 
0{m+m \og{N / m)) , respectively. It must be borne in mind that our algorithm is 
affected by two limiting factors. First, there is only limited information sharing 
between different learners. Second, by our definition of regret, our algorithm is 
forced to predict on m data points in one shot with a single parameter vector 
wl- This is in contrast with the sequential online learner which has access to 
the full data set and can use different parameter vectors for each of the m data 
points. 

If we treat all the distributed parameters across the learners as a single 
aggregated parameter wt = {w}, . . . ,w^), we can apply the results for sequen- 
tial online learning to obtain the generalization bounds for distributed online 
learning in terms of the regret bounds. Due to space limitation, we present the 
generalization bounds in the appendix. 



4 Privacy and Topology of Communication Graphs 

A common form of fHw) in the cost function Q is l{yl, (w, xl)). So the subgra- 
dient w.r.t. to wl is gl = dj{yl, (wl,x\)) x\, which is proportional to x\. Thus 
algorithms that transmit subgradients {e.g. the first variant of Langford et 



al.^a algorithm Zinkevich et al. 2009 ) may disclose sensitive information about 
raw data {e.g., medical record), which is undesirable for the privacy-sensitive 
applications mentioned, such as mining patient information across hospitals. 
Our decentralized algorithm transmits only local model parameters between 
neighbors in the network, reducing the possibility of information leakage. 

Formally, the communication graph is a directed graph C{A). The node set 
consists of the online learners {!,... ,m}. The edge set £ is {{i,j)\Aij ^ 0}, 
where node i is connected to node j if the weight Aij is nonzero. We say a 
node i is connected to j if and only if {i,j) G £. The neighbor set N{j) of 
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Parameters of 
P ► Q ► PmixQ — 




(a) (b) 



Figure 1: Illustrating the impact of network topology on privacy preservation. 
In each of the three-node networks, M is a malicious node (learner) that wants 
to gather the subgradients of P and Q. (a) M can easily reconstruct the sub- 
gradients of P and Q by differentiating successive parameters received from P 
and Q. (b) M cannot reconstruct the subgradients of P and Q. Intuitively this 
is because M does not receive any information from Q and the parameters of 
P is "mixed" with Q's parameters and subgradients. 

j is € £}. Intuitively the topology of the communication graph can 

affect the privacy-preserving capability. Consider the two examples in figure [T] 
to gain intuition. We assume that all nodes (learners) AI, P and Q know the 
matrix A representing the communication graph, and the convex set n = K". 
Suppose M is a malicious node that wants to gain information about the input 
data of P and Q by recovering their subgradients. Based on the communication 
graph in Figure [ij (a), M receives the parameters from P and Q. It can use the 
received parameters to compute the linear combination and find the subgradient. 
By contrast, it is intuitively difficult to recover the subgradients based on the 
communication graph in Figure [l](b). Here P's parameters are "mixed" with 
the Q's parameters through a linear combination at the local subgradient step 
(line [T] in Algorithm [l]) before sent to M, and M does not directly receive any 
information from Q. The ambiguity about the parameters of Q prevents the 
malicious node M from correctly reconstructing the local subgradients of P and 
Q. 

4.1 Full Reconstruction 

Inspired by these two examples, we formally examine under which conditions 
a malicious node cannot reconstruct all subgradients of other nodes based on 
the parameter vectors of its adjacent nodes. We refer to this problem as full 
reconstruction of subgradient, in contrast to the partial reconstruction of sub- 
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gradients discussed later. We assume = M" for this moment, i.e., there is 
no projection step in Algorithm [T] Projection will be handled differently later. 
Throughout this section, we shall use the following definitions and notations. 

Wt = [wl...,wr], Gt = [gl...,gr] 

We also assume that every learner (node) knows the whole communication ma- 
trix A and the initial parameter values Wi of all other learners. Without loss 
of generality, we may also assume the dimension of each wl (thus gl) is 1 , since 
Wt can be reconstructed row-by-row. 

Now we formulate the problem of reconstructing all subgradients of the other 
nodes based on the following linear time-invariant dynamic systenj^ 

[Yt^WtC ^' 

where Gt — —rjtGt is the (unknown) input (i.e., local subgradients), Wt is 
the state, and Yt is the output (i.e., the columns of Yt are parameter vectors 
received by M), and C is a matrix selecting the columns of Wt that node M 
receives. According to Brogan [1991 , the system S is invertible, if the output 



sequence Yt determines the unique input Gt. Therefore, we can rephrase the 
full subgradient reconstruction problem as the invertibility of S. Our theorem 
relates the invertibility of S to the topological properties of the communication 
graph. 

Theorem 2 // all other nodes are connected to M , then for almost any choice 
of nonzero entries in A, the output sequence Yt at the malicious node M gives 
rise to a unique sequence of subgradients Gt . On the other hand, if all other 
nodes are not connected to M , then regardless of the choice of nonzero entries 
in A, the output sequence Yt does not uniquely specify Gt. 

If all other nodes are connected to M, the malicious node can reconstruct 
Gt by duplicating the linear combination steps at the other nodes and differen- 
tiating the successive parameter vectors. This is exactly what happens in figure 
[T|a). The proof for the latter part of the theorem relies on the analysis of the 



generic rank of structured systems Sundaram & Hadjicostis 2009 Dion et al 
[2003, , which relates the rank of the transfer matrix {zl — A) C, z G C of 5 to 
the topological features defined by vertex disjoint paths of the communication 
graph. In the statement of the theorem, almost any means all choices of entries 
in A except a set of Lebesgue measure zero. These bad values are corresponding 



to the solutions of a polynomial function Dion et al. 2003 



•^Standard control notation is to treat the state of the system as a column vector, so that 
systems are written as wt+i = Awt + Gt, but the state vectors in this paper are written as 
row vectors in order to maintain consistency with the rest of the paper. 
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4.2 Partial Reconstruction 



Reconstructing the subgradients of all other nodes is severely constrained by 
the topology of the communication graph, the malicious node may turn to re- 
construct the subgradients from some of the nodes. A logical step forward from 
the full reconstruction problem is partial reconstruction. That is, given a set of 
nodes, what are the topological requirements for the communication graph that 
allows a malicious node to reconstruct the subgradients of this set of nodes. 

Suppose a malicious node wants to reconstruct the subg£adients of a set of 
nodes Af. For the purpose of analysis, we break the input Gt of the system S 
into two parts. One part G-^ is the columns of Gt that are corresponding to 
the subgradients of the nodes in Af, and another part G^ is corresponding to all 
other nodes. The dynamics of the algorithm can be described by the following 
system 5', which is equivalent to the system S. 

^, f Wt+i = WtA + G^B^ + G^Bu , 
• [Yt = WtC ^ ' 

Bj\f and Bu are suitable matrices that align the input to corresponding columns. 
Instead of considering the invertibility of S' , we consider the partial invertibility 
of S' — inverting only G^ from the output Yf. The next theorem relates the 
partial invertibility of 5' to the topological properties of the communication 
graph. 

Theorem 3 The necessary and sufficient conditions for the sequence of output 
vector Yt at the malicious node M giving rise to a unique sequence of G^ for 
almost any choice of nonzero elements in A are: 

i) All nodes in Af are connected to M . 

a) No other nodes are connected to the nodes in Af but not connected to M . 

The proof of sufficiency is a simple corollary of Theorem [2j If the nodes in 
Af and M satisfy the conditions in Theorem [sj the nodes of A/" U {M} form a 
network that satisfies the full reconstruction condition in Theorem [2] and M 
can reconstruct the subgradients of the nodes in Af by duplicating the linear 
combination and local subgradient steps at the node in Af. Similar to the full 
reconstruction, the only exception for the partial reconstruction is G-^ , whose 
recovery depends on the knowledge of the initial parameters Wj^ . The proof for 
necessity is significantly harder than that of the full reconstruction, and the long 
proof is given in the appendix. This theorem confirms our intuition by saying, 
for a set of nodes A^, if they directly provide information to M and there is 
no other nodes that "mix" unknown information into this set of nodes, M can 
reconstruct the subgradients of the nodes in Af, otherwise the subgradients can 



only be determined up to a linear subspace Sundaram & Hadjicostis 2009 . 

The theory developed above can guide us to examine or design commu- 
nication networks with privacy-preserving properties. We define a privacy- 
preserving communication network as the following. 
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Definition 4 We say a communication network C{A) is privacy-preserving if 
and only if the conditions in Theorem do not hold for any node M and any 
set of nodes Af. 

A set of nodes is called a vertex cut of a directed graph G if the removal of these 
nodes renders the graph disconnected. The connectivity k(G) of the graph is the 
size of the smallest vertex cut. Suppose a communication network is not privacy- 
preserving, then there exist node M and a set of nodes Af satisfy the conditions 
in Theorem [3] Furthermore, we assume that not all nodes are connected to M. 
Then removing M makes the graph disconnected because there is no path from 
the nodes in U to the nodes in JV, so {M} is a vertex cut and k{C{A)) — 1. 
The above analysis can be summarized by the following theorem. 

Theorem 5 For a communication network C{A), if k{C{A)) > 1 andy node 
j , \N{j)\ < m ^ 1, then C{A) is privacy-preserving. 

It can be shown that many interesting networks, including those studied by 



Duchi et al. 2010 , are privacy-preserving. For example, (a) the grid, where 
nodes are aligned on a 2-dimension grid and connected to the nearest 4 neigh- 
bors; (b) the k-dimension hyper-cube, where nodes are placed on the vertices 
of an imaginary k-dimension hyper-cube, and connected to the neighbor ver- 
tices; (c) expander graphs, one can construct expander graphs to have large 
connectivity. These graphs have good mixing properties. 

4.3 Reconstruction under Projection 

We define auxiliary variables = wl — wl and define Rt — [r^, . . . , r™]. Suppose 
again that the malicious node is interested in the node in the set J\f, the dynamics 
of the distributed online learning algorithm with projection can be described by 
the following system iS" 



S" 



Wt+i - WtA + G^Bm + G^, 



't+ij 



Bu 
Bu 



(10) 



Yt^WtG 



Note that reconstructing + F&^j^i in system S" is the same as reconstructing 
G^ in system 5', and it has been addressed in Theorem [s] Therefore, in order 
to reconstruct the subgradients G^ in system S'\ it is sufficient to reconstruct 
or separate the projection difference R^+i from G^. 

Under the formulation of S" , we consider r^tg\ and r\_^i as two separate 
inputs to the node (learner) i, but each node simply propagates the summation 
—rjtgl -\-rl^i. For certain types of convex sets, such as hyper-balls or polytopes, 
it is easy to find different data vectors having the same projection value. It is 
hard to separate G^ and R^i. Formally, we have the following theorem and 
the proof can be found in the appendix. 
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Theorem 6 In system S" , the output sequence Yt cannot determine a unique 
sequence of suhgradients for any communication network. 

The proof of the above theorem follows a similar line of that of Theorem [3] except 
different topological arguments. Theorem [6] should be exercised with caution. 
It is possible to gain information about the subgradients in the presence of a 
priori knowledge. For example, if is a I2 ball, rjtg\ and r\^-^ are co-linear, so 
the summation —rjtgl + rl_^_i can determine gl up to a constant factor. 
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Performance on RCV1 Data 
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Tfie Effect of Topology on Convergence 
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Figure 2: (a) and (b): Convergence of distributed learning on synthetic and 
real datasets. On both datasets, our distributed online learning algorithm uses 
up to 256 nodes linked by hypercubes. It converges to the test error rate of 
sequential online learning, (c) Convergence of distributed learning with different 
communication graphs consisting of 256 nodes on synthetic data. When the 
communication graphs are grids or hypercubes, the algorithm converges slightly 
slower than when the communication graphs are cliques. But unlike cliques, 
grids and cliques prevent malicious nodes from reconstructing subgradients of 
other nodes. 
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5 Related Works 



Recently some research effort has been devoted to devising distributed onhne 
learning. For instance Zinkevich et al. 2009 shows that one can distribute the 



data on slave nodes. The slaves periodically poll the centralized master node to 
receive the latest parameter vector. This is used to compute stochastic gradients 
which are then fed back to the master node at the expense of using delayed 
subgradients. Their bounds have the form 0{VtN) and 0{t + T\og{N/m)), 
where r is the delay in the subgradient calculation. Given the fact that t is 
as large as m in a round-robin fashion communication scheme, the bounds of 
Zinkevich et al. 2009 are similar to ours. 



The decentralized learning paradigm was pioneered in distributed optimiza- 



tion. For example, Duchi et al. 2010 proposed a dual averaging algorithm for 



distributed convex optimization. They provided sharp bounds on their conver- 
gence rates as a function of the network size and topology by careful mixing time 
arguments. Zinkevich et al. 2010 proposed to perform local stochastic gradient 



descent individually then give the output as the average of local parameters at 
the final step. However, their fixed step size assumption does not guarantee the 
algorithm to converge to the true optimum. In terms of algorithmic structures 
and underlying mathematical foundations, our algorithm is a natural extension 
of the works of Nedic & Ozdaglar 12009 and Ram et al. 2010 for distributed 



convex optimization to online learning, but our analysis handles strongly con- 
vex function and yields O(logr) regret. If our regret bounds are converted to 
convergence rates, then we obtain not only 0(l/e^) rates for convex functions, 
but also 0(l/e) rates for strongly convex functions, which are not covered by 



Nedic fc Ozdaglar, ,2009j , ^Ram et al. 2010 . Except the work of Zinkevich et al 



2010 , which is obviously privacy-preserving due to the lack of communication, 
none of these works considered the privacy-preserving aspect of the algorithms. 

Privacy-preserving has been an active research area in machine learning and 
data mining. Most privacy-preserving machine learning algorithms modify the 
original algorithms with cryptographic tools to achieve privacy preservation. 
Two popular techniques are secure multi-party computation (SMC) and ran- 
domization. For example, the privacy-preserving versions of linear regression 



Vaidya et al. 2005 , belief propagation/Gibbs sampling Kearns et al. 2007 



and online prediction over discrete values [Sakuma fc Arai 2010 use SMC to 
securely compute function values over distributed data without disclosing them 



to unwanted identities; the privacy-preserving logistic regression Chaudhuri & 
2009] uses randomized perturbation to modify the cost function to 



Monteleoni 



preserve data privacy. Many algorithms, such as association rule mining and de- 
cision tree, can use either SMC or randomization to achieve privacy preservation 
Vaidya et al. 2005 . Compared to the algorithms using SMC and randomiza- 



tion, our analysis on privacy does not require any modification of the original 
algorithm. The privacy-preserving properties of ours are intrinsic in the sense 
that it only relies on a component of our algorithm, the communication graph, 
to prevent disclosure of local subgradients (hence data) to other nodes. 

By treating local parameter wl as an aggregated vector of local subgra- 
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dients (data), our approach to privacy preservation is closely related to the 
aggregation-based methods on a conceptual level. For example, Riiping 2010 



trains support vector machines by using group probability over subsets of data. 



Avidan & Butman 2007 proposed a boosting based privacy-preserving face de- 



tection algorithm by restricting the learner to use limited features provided by 
the data feeder. One drawback of these algorithms is they sacrifice algorithm 
performance for data privacy by only revealing aggregated or limited informa- 
tion. By contrast, our algorithm achieves the same asymptotic convergence rate 
as the sequential algorithm on a fixed number of learners. 



6 Simulations 

We conduct two set of simulations to illustrate how quickly the generalization 
error of our distributed learning algorithm converges given certain number of 
nodes and to examine the impact of the topology of communication graphs 
on the convergence rate. For our implementations, each fiiw) has the form 
h{yl {w,x\)), where {{x\,yl) £ K" x {±1}} are the training data available only 
to the node, and h{x) is the hinge loss function h{x) = max{l — XiO}. For 
robustness, we set the learning rate rjt = 

First, we investigate how the number of nodes affects the predictive perfor- 
mance of our algorithm on both synthetic and RCVl datasetf]^ The synthetic 
data are generated uniformly from a 10-dimension unit ball. The classifier is 
randomly sampled and less than 10% of the labels based on the true classifier 
are flipped to the wrong labels. In total, we generate 1,000,000 training and 
500,000 test examples. The second dataset is actually a subset of the RCVl 
dataset. This subset contains 100,000 training examples, 100,000 test examples, 
and 47,236 features with many zero entries for each sample. Figures [2] (a) and 
(b) summarize the results. In line with the theoretically guarantee the regret 
our distributed algorithm converges, the test error of our algorithm, even with 
256 nodes, indeed converges to that of the sequential learner on both datasets. 

For the second experiment, we construct three types of communication 
graphs consisting of 256 nodes: i) grid where nodes are laid and connected 
on a 2-D mesh grid; ii) hypercube where nodes are laid and connected on a 
8-dimensional hypercube; 3) clique where the nodes form a clique. As shown 
in figure [2jc) , the clique topology leads to slightly faster convergence than grid 
and hypercube, but it discloses subgradients in the presence of malicious nodes 
according to Theorem [2j 



7 Discussion 

We have only analyzed the case where the communication matrix A is fixed, 
and does not evolve over time. Our proofs can be extended to the settings of 
asynchronous update or random communication as studied by |Nedic fc OzdagTar 

' http: //www. csie .ntu. edu. tw/-cjlin/libsvmtools/datasets/binary . 
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2009 . The resulting linear systems are time-invariant, which is much harder to 
analyze. However, we conjecture that all the privacy-preserving properties still 
hold if the transient network connectivity is greater than one upon any update 
step. 

A Proofs of the Regret Bounds 

The subgradient (set) dxf{-) of a convex function J{x) at xq is defined as 

g e d,f{xo) ^ yy, fiy) - f{xo) > {y - xq, g) . (11) 

A convex function /(•) defined on domain Q is said to be strongly convex with 
modulus A > if and only if 

Vx, yen, f{y) - f{x) -{y-x, dj{x)) >\\\y- x\\^ (12) 

where dxf{x) is the subgradient. The Euclidean projection operator onto a set 
n C M" is defined as 

Pn{w') — argmin ||w — w'\\ . (13) 
wen 

We define the average parameter vector wt as 

1 ™ 

1=1 

Our proof is based on an analysis of the sequence of values Wt- 
A.l Lemmas 

We start from a key result concerning the decomposition of regret is Lemma [7] 
given below. 

Lemma 7 Let wl denote the sequences generated by Algorithm^ Denote = 
dwftiwt). For any w € we have 

\\wt+i - w\\' < (1 - 27y,A) \\wt - w\\' + 5 U\\^ 
2%, 



m 

m 

1=1 



-iftiwt) ~ ftiw)) 



'^'J2(hi\\ + \\~9i\\)h*-<\\ 

m 

Y^hiWht - w\+i\\ (15) 



TO 

i=l 



2% 

TO 



1=1 
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Proof Define 



Wf — Wf 



Po wl 



(16) 



Recall that VL is assumed to be convex, ^ is a doubly stochastic matrix, and 
wl G H. for all j. Therefore, Aji > 0, J2j ^ji — 1' J2j ^ji'^^t ^ ^ ^^r all 
i. By this observation, the definition of the projection operator (13), and the 
definition of lil^i in Line 6 of Algorithm [l] we have the following estimate for 
the norm of rl^i 



't+i\ 



< 



"t+1 



Vt 9t 



(17) 



Then, we define the following matrices to simplify the notations. 



Wt 



[Wt,---,Wt J, 



Wt 



Since A is doubly stochastic Ae = 1. Therefore, by using (16 1 and the update 
in step 6 of Algorithm [T] we have the relation 

wt+i = —Wt+ie = —{WtA - TjtGt + Rt+i)e 
m m 



— Wte- —Gte 
m m 

Vt 



1 



-Rt+ie 



Wt 



m 



E5t + -E'^*+i- 

^ — ^ m ^ — ^ 



4=1 



4=1 



(18) 



Using the above relation we unroll Htut+i — w\\ by 



\wt+i - w\\^ = \\wt - w\\^ + ^ 



1=1 



5: (19) 



i=l 



In view of (17) 



E' 



'^t+i + mat) 



<-Ay.\vui\ 

\i=l 



+ vt ml 



(20) 
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Next we turn our attention to the 

-^{9l,wt-w) 



term which we bound using (111 and (12 1 as follows 



- {ghwt - w) = - {glwt - w\) - {g\,w\ - w) 

< \\gl\\ \\wt - w\\\ + fiiw) - fliwi) - A \\w\ - w\\ 
= hlW \\wt - mi + fliwt) - fl{w\) - A \\w\ - w\\ 

< hiW \\'wt-wl\\+{glwt-w\)-\\\wl-wt\\ 
~\\\wl-w\\+fl{w)- fl{wt) 

^ ( W ill I II— illXII ill 

< {\\9t\\ + \\9t\\) If* - wt\\ 

^X\\w,^w\\+fl{w)~fl{wt). 
The last inequality is by using 

{g\,wt-wi) < \\gi\\ \\wt-wi\\ 

and 

\\wl — Wt\\ + \\wl — w\\ > \\wt — w\\ 
Summing up over i = 1, . . . , m, obtains 

m m 

-J2(9l^t-^) ^^ihlW + WalW) \\wt - wl\\ 

i=l 1=1 

- Am \\wt - w\\ - {ft{wt) - Mw)) 
The projection operator satisfies the following property 

{Pn{w) -w,w ~w) < ~\\Pn{w)-wf < 0, Vw e n. 



(21) 



(22) 



In order to estimate (^rl^i,wt — w), we use (16), (24), and (17) to write 

+ {Pn (wj+i) " u^tVi>j+i - w) 
^ '^t Iktll Ih* ~ ^t+i|| ■ (23) 



Combining (20), (21) and (23) with (19 1 completes the proof. 



The projection operator satisfies the following property 

{Pn (w) — WjW ~ w) < — ll-Fh (w) — < 0, Vw e fl. 



(24) 



The following lemma to upper bound the terms \\wt — wl\\ and ||ift — Wj+ij 
in (15 1. The convergence rate in ([s]) plays a central role in this lemma. 
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Lemma 8 // the assumptions in section^hold, and let f3 be as in then 

t-i 

\\wt-wl\\<4LY,Vt^k(3''-' (25) 
fc=i 
t-i 

\\wt-wl+,\\<4Lj2vt-kl3''. (26) 



fc=0 



Proof 



Using the notations defined in the proof of Lemma [7j we unroll the relation 
Wt^Wt^iA~r^tGt-i+Rt (27) 
which is defined through Algorithm [l] yields 



k=l 



k=l 



Using A'^e = 1 for all k, ([S]), ( [T7| , and the above relation we can write 



\Wt - Wt\ 



t I — e - e, 
m 



t-i 



< II wi - Will + rjt-k 
fc=i 

t-i 



Gt-k \ —e-A 



fe-i 



E 

fe=i 

t-i 



1 



Rt-k+i — e - A1 



<4L^77t_fc/3'=-i 



fc=i 



We omit the proof for ( 26 ) which follows along similar lines. 



(28) 



A general lemma on the regret bounds is the following 

Lemma 9 Let w* G fl* denote the best parameter chosen in hindsight. Then 
the regret of Algorithm^ can be bounded via 



T 

^ Mwl) -f,{w*)<mF(^^-TX 



(29) 



t=i 
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where C is a communication- graph- dependent constant defined as 

5-13 



C 



1 



(30) 



Proof Set w — w* ^ divide both sides of (15) by ^ and rearrange to obtain 



ft{wt) - ft{w*) 

Tfi 

< — [(1 - 277tA) \\wt - w* - ll^t+i - w* 
277t 



m 1 ^ 



9t 



2ly: 



\Wt — Wf\ 



Plug in the estimate of the subgradients and the bounds ( 25 ) and ( 26 ) . 



Mwt) - Mw*) 

<^[{l-2rjtX)\\wt-w*\\-\\wt+^-w*\\] 

t-l i-l 



fe=i 



fe=0 



m 



<—[(!- 2r^tX) \\wt -w*\\- \\wt+i - w* 
2f7t 



fc=i 



Summing over t = 1. . . . , T 



Y^f,{w-t)-Mw*) 
t=i 

t=i ^* 



w II - ||wt+i - w 



T t-1 



t=i 



t=i fe=i 



= C2 
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Since the diameter of is bounded by F 

' 1 1 



V277T 

Let /(i > fc) be the indicator function which is 1 when t > k and otherwise. 
Then 

T T T T 

t=i fc=i fc=i t=fe+i 

T T ^ T 



k=l t=l ^ t=l 



Plug in the estimate for Ci and C2, to obtain (291 



A. 2 Proof of Theorem [T] 

First consider A > with rjt ~ In this case 2^ — TX, and consequently 
(29) in Lemma [9] specializes to 



Eam-/*k)<^EJ 
t=i t=i 

CL?m , , 
< ^^(l + log(T)). 

When A = 0, and we set r?t = and to rewrite (29) as 

T T 

Y^ftiwt)- ft{w*)<mFVf + CL'mY,-^ 
t=i t=i I 

< mFVT + CL^mVT. 



B Generalization Bound 

We investigate the relationship between the regret bounds and the generalization 
ability of the proposed algorithms. Let T be the space of all possible choices 
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of fl{w) equipped with a probabihty measure. Random variables are denoted 
as capital letters, e.g. fl{w) is a realization of the random variable Fl{w) £ 
We assume the functions fl{w) are generated as i.i.d. random elements in 
F according to the unknown distribution over F. The risk of w is defined 
as rk(t(;) = E[F(t(;)]. A common form of fliw) in the cost function Q is 
l{yl,{w,x\)) where •) is the loss function. In this case, the risk is the expected 
loss when the parameter is w. Since the data x\ are bounded in most cases, we 
can assume the loss ?(•,•) or the functions /( are bounded. Let N denote the 
number of all functions fl up to the iteration T and N = mT. The following 
theorem bounds the risk by the regret TZda- 

Theorem 10 //V / G J", |/| < \, then for any < (5 < 1; with at least 1 — 5 
probability, we have 



inf rk(T4^/) — minrk(w) < — 
t=i,...,T ^ * ' wen ^ ' N 



36^ TZda + 3 2 / TZda + S 

— hi \ rCnA til 

N 5 n\ 5 



(31) 



are random. The inequality (31) gives 



0{1/N) bound on the risk of the best aggregated parameter for strongly convex 
functions, which translates to 0(l/e) convergence rate (in probability). The 
key to the proof of the theorem is the generalization bound for sequential on- 
line learning by Cesa-Bianchi & Gentile 2006 , which is based on Bernstein's 



martingale inequality. 
Proof Let Wt — {w} , . . . 

Wt can be represented by a function of wt by wj 
ft{wt) = X^I^i /t )■ The aggregated risk is defined as 



*) be the parameter vector at the iteration t. Since 

Cj, we can define 



(32) 



Since are i.i.d., we have rk(wj) = yi"Li ^l-^^ti'^t)] — ™ ' rk(i(;j). 

In terms of / j and Wt , Algorithm 111 can be regarded as a sequential online 
learning algorithm that updates Wt with ff . This view of the algorithm falls 



into the general setting of online learning algorithm studied in Cesa-Bianchi 



& Gentile 



2006 



if we further interpret Wt as hypotheses and ft as training 

gives 



examples. Proposition 2 of Cesa-Bianchi & Gentile 2006 



(jrpMWt^ < 



n 



DA 



T 



T 



miny^y^ K'fw) 



t=l i=\ 



36 TIba 

— in ^ 

T 8 



+ 2 



T^da , TZda 



-)>l-6 



(33) 
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The theorem follows by recognizing the fact rk(VFt) = m ■ rk(W/) and 



E 



T m 



min V Vf/(i 



t=l i=l 



< minE 



min T • E = T min rk{w) . 



C Proofs of the Privacy-Preserving Results 
C.l Proof of Theorem [2] 

A path p from node iq to node ir is a sequence of nodes io, ii, . . . , ir , and 
is an edge for every > j < r. Two paths pi and p2 are disjoint if 
they have no common nodes. A set of paths are disjoint if they are pairwise 
disjoint. Let Xi and X2 are two sets of nodes, a r — linking between Xi and 
X2 are a set of r disjoint paths that start in Xi and end in X2 Sundaram & 



Hadjicostis[|2009| . 

We apply Theorem 1 in Sundaram fc Hadjicostis[ 2009 to the system de- 
scribed by S, where M can reconstruct the input, i.e. gradients Gt, if and only if 
there exists am — linking from all nodes {1,2,..., m} to M and its neighbors. 
However, this is only possible when every nodes are neighbors of M and the 
paths of the m — linking are the nodes themselves. 



C.2 Proof of Theorem d 

As sufhciency is straight-forward, we prove the necessity here. For a sequence 
of real vectors {At}'^i, the one-side z-transform is defined as 

CO 

A(z) = Ez-%+i, zeC (34) 
t=o 

where A{z) is well-defined in the complex plane except of a disk centered at 
zero. The relation between the z-transforms of the variables in the system S 
(equivalently S') is 

Y{z) = Wi{zl - A)-^Cz + G{z) [zl ~ A)-^C (35) 

v ' 

transfer matrix 

where the transfer matrix of the system S' is defined as 
r(z) = {zI-A)-^G = 



Bu{zl^ 



-A)-^G 
A)-^G 









Tu{z) 



(36) 
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Each element of T{z) is a rational function and the matrix rank is taken over the 
rational expression field. WeJ^urther assume that Wi = 0, otherwise it may be 
absorbed into the first input Gi. The readers may find more detailed description 
of the above definitions and concepts in standard textbooks on modern control 
theory, e.g. Brogan 1991| . The proof is divided into two steps. 

Step 1: Supposing \J\f\ = r, we first show that if Yt determines a unique 
sequence of inputs to the nodes in Af, we must have 



rank( 



Tuiz) 



) - rank(T;^(z)) = r 



(37) 



We prove this by contradiction. Suppose (37) does not hold. Then there exists 



at least one row of Tj^{z) that is linearly dependent on the other rows of T{z). 
Let Tj^j-{z) be this linearly dependent row. Then, there exists a vector G(z), 
with the i-th element nonzero such that G{z)T{z) = 0. This corresponds to a 
nonzero input at one of the nodes in A/", but the output Yf is zero for all time, 
and thus this nonzero input cannot be recovered. 



Step 2: We relate the rank condition (37) to the topology of the communi 



cation graph in this step and complete the proof. 

Let us denote the set of the neighbor nodes of M as V. According to Sun- 

and 



daram & Hadjicostis 2009 



Dion et al. 2003 , the rank of the transfer 



matrix of S' can be analyzed under the framework of structured systems. Given 
a graph, for any choice of nonzero elements in A except for a set of measure 
zero, 

rank(T(z)) = max. ^ of vertex disjoint paths 
from all nodes to {M JUV 
rank(T;^(z)) = max. # of vertex disjoint paths 
from U to {M} U V 

It is obvious that rank(T(z)) — deg(Af) + l where deg(M) is the degree of M, as 
we may choose the vertex disjoint paths to be the nodes in {M}U7^ themselves. 
We denote rank(r;^(z)) — u. The rank condition (37) reads 



deg(M) + l = r + u (38) 
First, partition the set {M)\JV as {M}\JV -N and {{M} U P} n TV. Thus 



degM + l = {M}U7' 

= \{M}\jr -U\ + \{{M}\JV}r\N\. 



(39) 



Now, if N is not contained in {M}yj'P, then we have |{{M}U7'}niV| < |7V| = r. 
Furthermore, since { Af } U P — A/" is a subset oiU, we have u > \ {M} UP— Af\ . 
Thus we would have degM + 1 < u + r, which contradicts (38). Thus wc must 
have M being a subset of {M} U V. 

Next, suppose that some node in Af has a neighbor in U that is not also 
in {M} U V. Then we have u > \{M} U V - Af\, and since |7V| = r (which 
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means that \{{M} UVjnN] = \Af\ = r), we have degM + 1 < u + r, which 
again contradicts (38). Thus, no node in M can have a neighbor that is not in 
{MjUV. 



C.3 Proof of Theorem M 

The inputs are Gt and Rt- Let Biji = [B^, Bjj-, B'^]'^ the transfer matrix of S" 



IS 



-{zl 






'Bu{zl 








{zl 






Bu'izI 






Tu'{z) 



T{z) 



Similar to step 1 in the proof of Theorem|3] the output sequence Yt determines 
a unique sequence of subgradient inputs Gt to the nodes in A/" if and only if 



rank( 



T^{z)' 
Tw{z) 



) - rank(T;^'(z)) = r 



(40) 



Next, we relate the rank condition (40 1 to the topological property of the com- 



munication graph. We construct a directed graph C'{A) by adding two input 
nodes ig and v for each node (learner) i in the communication graph C {A) and 
two edges {ig,i) and {ir,i)- The two input nodes are corresponding to r]tgl and 
rl respectively. The definition of Biji suggests the following consistent definition 



W ^{ig\ieU}U{ir} (41) 

Let us denote the set of neighbor node of MasV. According to [Dion et al 
2003| , for almost any choice of A, the rank of the transfer matrix T{z) and 
Tu'{z) are 

rank(T(2)) — max. ^ of vertex disjoint paths 
from all input nodes to {A/} U V 
rank(TiY' (z)) = max. ^ of vertex disjoint paths 
from U' to {M} U V 

For each vertex disjoint path starting from G A/", placing ig with v also 
forms a vertex disjoint path. We can conclude that rank(T(z)) — ra.nk{Tur{z)). 
Therefore the sequence Yt cannot determine a unique sequence of subgradients 
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